Example of the ml_regressor() function

Author: Charles Le Losq

The rampy.mlregressor class performs automatic data scaling, hyperparameter grid search and provides access to popular algorithms (Support Vector, Kernel Rdige, Neural Nets) for using machine learning in a regression task implying spectroscopic data. This function allows one to link any variable to a set of spectra with using a machine learning technique.

Let's assume for the sack of example that we observe spectra D that are the combination of $k$ endmember spectra S with concentrations C, such that:

$$ D_{i,j} = C_{i,k} \times S_{k,j} $$

Here we assume a linear combination. In Python, assuming that the partial spectra are simple Gaussians, we can write the following.


In [1]:
%matplotlib inline
import numpy as np
np.random.seed(42) # fixing the seed
import matplotlib
import matplotlib.pyplot as plt
import rampy as rp
from scipy.stats import norm

In [2]:
x = np.arange(0,600,1.0)
nb_samples = 300 # number of samples in our dataset

S_1 = norm.pdf(x,loc=200.,scale=130.)
S_2 = norm.pdf(x,loc=400,scale=70)
S_true = np.vstack((S_1,S_2))
print("Number of samples:"+str(nb_samples))
print("Shape of partial spectra matrix:"+str(S_true.shape))

C_ = np.random.rand(nb_samples) #60 samples with random concentrations between 0 and 1
C_true = np.vstack((C_,(1-C_))).T
print("Shape of concentration matrix:"+str(C_true.shape))


Number of samples:300
Shape of partial spectra matrix:(2, 600)
Shape of concentration matrix:(300, 2)

We make some observations with random noise


In [3]:
Obs = np.dot(C_true,S_true) + np.random.randn(nb_samples,len(x))*1e-4

# norm is a class which, when called, can normalize data into the
# [0.0, 1.0] interval.
norm = matplotlib.colors.Normalize(
    vmin=np.min(C_),
    vmax=np.max(C_))

# choose a colormap
c_m = matplotlib.cm.jet

# create a ScalarMappable and initialize a data structure
s_m = matplotlib.cm.ScalarMappable(cmap=c_m, norm=norm)
s_m.set_array([])

# plotting spectra
# calling the ScalarMappable that was initialised with c_m and norm
for i in range(C_.shape[0]):
    plt.plot(x,
             Obs[i,:].T,
             color=s_m.to_rgba(C_[i]))

# we plot the colorbar, using again our
# ScalarMappable
c_bar = plt.colorbar(s_m)
c_bar.set_label(r"C_")

plt.xlabel('X')
plt.ylabel('Y')
plt.show()


Machine Learning

We can train a machine learning algorithm to follow changes in D as a function of C, assuming we measured both quantities.

Rampy uses scikit_learn algorithms to do so in a easy way, with under-the-hood standardization and cross-validation. This is to use for "simple" projects, while more complicated things will required an ad hoc approach, directly using scikit-learn or any other relevant ML library.

For now, we will show how we can train neural networks to link D and C, such that we can predict C from new observations of D.

Let's first print the help to read the documentation...


In [4]:
### TO ADD PRIOR TO RELEASE
help(rp.mlregressor)


Help on class mlregressor in module rampy.ml_regressor:

class mlregressor(builtins.object)
 |  use machine learning algorithms from scikit learn to perform regression between spectra and an observed variable.
 |  
 |  
 |  Attributes
 |  ----------
 |  x : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |      Spectra; n_features = n_frequencies.
 |  y : array, shape = (n_samples,)
 |      Returns predicted values.
 |  X_test : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |      spectra organised in rows (1 row = one spectrum) that you want to use as a testing dataset. THose spectra should not be present in the x (training) dataset. The spectra should share a common X axis.
 |  y_test : array, shape = (n_samples,)
 |      the target that you want to use as a testing dataset. Those targets should not be present in the y (training) dataset.
 |  algorithm : String,
 |      "KernelRidge", "SVM", "LinearRegression", "Lasso", "ElasticNet", "NeuralNet", "BaggingNeuralNet", default = "SVM"
 |  scaling : Bool
 |      True or False. If True, data will be scaled during fitting and prediction with the requested scaler (see below),
 |  scaler : String
 |      the type of scaling performed. Choose between MinMaxScaler or StandardScaler, see http://scikit-learn.org/stable/modules/preprocessing.html for details. Default = "MinMaxScaler".
 |  test_size : float
 |      the fraction of the dataset to use as a testing dataset; only used if X_test and y_test are not provided.
 |  rand_state : Float64
 |      the random seed that is used for reproductibility of the results. Default = 42.
 |  param_kr : Dictionary
 |      contain the values of the hyperparameters that should be provided to KernelRidge and GridSearch for the Kernel Ridge regression algorithm.
 |  param_svm : Dictionary
 |      containg the values of the hyperparameters that should be provided to SVM and GridSearch for the Support Vector regression algorithm.
 |  param_neurons : Dictionary
 |      contains the parameters for the Neural Network (MLPregressor model in sklearn).
 |      Default= dict(hidden_layer_sizes=(3,),solver = 'lbfgs',activation='relu',early_stopping=True)
 |  param_bagging : Dictionary
 |      contains the parameters for the BaggingRegressor sklearn function that uses a MLPregressor base method.
 |      Default= dict(n_estimators=100, max_samples=1.0, max_features=1.0, bootstrap=True,
 |                    bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=rand_state, verbose=0)
 |  prediction_train : Array{Float64}
 |      the predicted target values for the training y dataset.
 |  prediction_test : Array{Float64}
 |      the predicted target values for the testing y_test dataset.
 |  model : Scikit learn model
 |      A Scikit Learn object model, see scikit learn library documentation.
 |  X_scaler :
 |      A Scikit Learn scaler object for the x values.
 |  Y_scaler :
 |      A Scikit Learn scaler object for the y values.
 |  
 |  Example
 |  -------
 |  
 |  Given an array X of n samples by m frequencies, and Y an array of n x 1 concentrations
 |  
 |  >>> model = rampy.mlregressor(X,y)
 |  >>>
 |  
 |  Remarks
 |  -------
 |  
 |  For details on hyperparameters of each algorithms, please directly consult the documentation of SciKit Learn at:
 |  
 |  http://scikit-learn.org/stable/
 |  
 |  For Support Vector and Kernel Ridge regressions, mlregressor performs a cross_validation search with using 5 KFold cross validators.
 |  
 |  If the results are poor with Support Vector and Kernel Ridge regressions, you will have to tune the param_grid_kr or param_grid_svm dictionnary that records the hyperparameter space to investigate during the cross validation.
 |  
 |  Results for machine learning algorithms can vary from run to run. A way to solve that is to fix the random_state.
 |  For neural nets, results from multiple neural nets (bagging technique) may also generalise better, such that
 |  it may be better to use the BaggingNeuralNet function.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, x, y, **kwargs)
 |      Parameters
 |      ----------
 |      x : array{Float64}
 |          the spectra organised in rows (1 row = one spectrum). The spectra should share a common X axis.
 |      y : Array{Float64}
 |          Target. Only a single target is possible for now.
 |  
 |  fit(self)
 |      Scale data and train the model with the indicated algorithm.
 |      
 |      Do not forget to tune the hyperparameters.
 |      
 |      Parameters
 |      ----------
 |      algorithm : String,
 |          "KernelRidge", "SVM", "LinearRegression", "Lasso", "ElasticNet", "NeuralNet", "BaggingNeuralNet", default = "SVM"
 |  
 |  predict(self, X)
 |      Predict using the model.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape = (n_samples,)
 |          Returns predicted values.
 |      
 |      Remark
 |      ------
 |      if self.scaling == "yes", scaling will be performed on the input X.
 |  
 |  refit(self)
 |      Re-train a model previously trained with fit()
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Choice of the algorithm

Many popular algorithms are possible: KernelRidge, SVM, LinearRegression, Lasso, ElasticNet, NeuralNet, BaggingNeuralNet.

LinearRegression, Lasso and ElasticNet are borned to linear problems, while KernelRidge, SVM, NeuralNet or BaggingNeuralNet can do both linear and non-linear problems. See the documentation of scikit-learn for further details, and all the articles online on those techniques. Those techniques need hyperparameters that are provided as a dictionary to rampy.mlregressor. I strongly encourage you to read the documentation on scikit-learn for the technique you want to use, in order to set those hyperparameters in the good range.

In the present case, we are dealing with a linear problem. We thus can use the Lasso or ElasticNet algorithms.

Machine learning algorithms need a training dataset, and their performance is evaluated on a "testing" dataset that is unseen of the algorithm. rp.mlregressor can split your dataset automatically, or you can choose to provide the two different datasets. Parameters for each algorithm are provided as a dictionary, see the documentation of scikit-learn for further details on them. The datasets are also standardized/normalised inside the function (if you don't know what this means, click here). The type of scaling is changed by the scaler option.

rp.mlregressor creates an object that possess many attributes that are set by default but can be tweaked as preferred. The object also possess two methods: fit and predict, similar to the scikit-learn API. However, contrary to scikit-learn and due to the aim of the rampy.mlregressor class, the dataset is loaded upon initialisation of the object. Fit is performed at a second stage as one can switch easily between different algorithms.

The predict function works as that of a scikit-learn model, predicting new values from a new X dataset. Scaling is performed if rampy.mlregressor.scaling is set to True.

For now, we are going to load our data and create our mlregressor object:


In [5]:
model = rp.mlregressor(Obs,C_true[:,0].reshape(-1,1))

This default implementation takes care of splitting the dataset in training and testing subsets in a ratio 70/30, then scaled it with a standard scaler. The test size can be seen (and then set) as:


In [6]:
model.test_sz


Out[6]:
0.3

The scaler is accessed as


In [7]:
model.scaler


Out[7]:
'MinMaxScaler'

Is scaling active?


In [8]:
model.scaling


Out[8]:
True

Now we are going to fit the data switching between the available algorithms in a loop. We use the default dictionnaries for hyperparameters, which will be optimised by gridsearch for most of the algorithms. This should be further tuned in a real-life application.


In [9]:
for i in ["KernelRidge", "SVM", "LinearRegression", "NeuralNet", "BaggingNeuralNet"]:
    model.algorithm = i
    model.user_kernel = 'poly'
    model.fit()
    plt.figure()
    plt.title(model.algorithm)
    plt.plot(model.y_test,model.prediction_test,"r.",label="Test")
    plt.plot(model.y_train,model.prediction_train,"k.",label="Train")
    plt.xlabel("Y observed")
    plt.ylabel("Y predicted")
    plt.plot([0,1],[0,1],"k--",label="1:1 line")
    plt.legend()


The support vector machine algorithm did not work well. We can try using a linear kernel, as the default one is set to radial basis function:


In [10]:
model.user_kernel


Out[10]:
'poly'

and the hyperparameters are


In [11]:
model.param_svm


Out[11]:
{'C': [1.0,
  2.0,
  5.0,
  10.0,
  50.0,
  100.0,
  500.0,
  1000.0,
  5000.0,
  10000.0,
  50000.0,
  100000.0],
 'gamma': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
        1.e+04])}

We can update it, and even provide other parameters if looking at sklearn help for SVM regressor.


In [12]:
model.param_grid_svm = dict(C= np.logspace(-5,5,40), gamma= np.logspace(-5,5,40))

We also update the kernel...


In [13]:
model.user_kernel = 'linear'

And run the training again... This will take longer as we explore a larger hyperparameter space...


In [14]:
model.algorithm = "SVM"
model.fit()

plt.figure()
plt.title(model.algorithm)
plt.plot(model.y_test,model.prediction_test,"r.",label="Test")
plt.plot(model.y_train,model.prediction_train,"k.",label="Train")
plt.plot([0,1],[0,1],"k--",label="1:1 line")
plt.xlabel("Y observed")
plt.ylabel("Y predicted")
plt.legend()


Out[14]:
<matplotlib.legend.Legend at 0x7fc6031d5ba8>

We can test to refit the data with a different kernel


In [15]:
model.user_kernel = 'poly'
model.refit() # refit avoid running again model declaration and data standardisation.

plt.figure()
plt.title(model.algorithm)
plt.plot(model.y_test,model.prediction_test,"r.",label="Test")
plt.plot(model.y_train,model.prediction_train,"k.",label="Train")
plt.plot([0,1],[0,1],"k--",label="1:1 line")
plt.xlabel("Y observed")
plt.ylabel("Y predicted")
plt.legend()


Out[15]:
<matplotlib.legend.Legend at 0x7fc603024be0>

Not really better. Maybe SVM is not the best technique for this problem, or we have a problem in the hyperparameters...

Let's try other algorithms, like neural nets and their ensemble version.

Neural Nets

We see above that our ensemble method that trains 100 neural nets and get an estimate from their mean. It seems to work well. We can play with the hyperparameters.

First, we now train 1000 networks, not 100, by setting n_estimators to 1000.

We also tune the architecture of the network, by putting 10 activation functions in a single hidden layer.


In [16]:
model.algorithm = "BaggingNeuralNet"
model.param_bagging = dict(n_estimators=1000, max_samples=100, max_features=len(x), bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=2, verbose=0)

print("Shape of the hidden layers:"+str(model.param_neurons['hidden_layer_sizes']))
print("Activation functions in the hidden layers:"+str(model.param_neurons['activation']))

model.fit()
plt.figure()
plt.title(model.algorithm)
plt.plot(model.y_test,model.prediction_test,"r.",label="Test")
plt.plot(model.y_train,model.prediction_train,"k.",label="Train")
plt.plot([0,1],[0,1],"k--",label="1:1 line")
plt.xlabel("Y observed")
plt.ylabel("Y predicted")
plt.legend()


Shape of the hidden layers:(3,)
Activation functions in the hidden layers:relu
Out[16]:
<matplotlib.legend.Legend at 0x7fc602f45ba8>

The networks perform not very well with this setup... The BaggingRegressor uses a base model, which actually is the MLPregressor used when setting rampy.mlregressor.algorithm = "NeuralNets". It is possible to tweak the hyperparameters of the base regressor to try improving the fit.

We see above that we have 1 hidden layer with 3 RELU units. Let's have 3 layers with 10 units in each layer. This is a deeper network that can work well on complex problems. We increase the number of hidden layers (in a tuple), as described in scikit-learn MLPregressor help:


In [17]:
model.param_neurons['hidden_layer_sizes'] = (10,) 
model.param_neurons['activation'] = "relu" # we also could try changing the activation to tanh. Try it!

In [18]:
model.algorithm = "BaggingNeuralNet"
print("Shape of the hidden layers:"+str(model.param_neurons['hidden_layer_sizes']))
print("Activation functions in the hidden layers:"+str(model.param_neurons['activation']))
model.fit()
plt.figure()
plt.title(model.algorithm)
plt.plot(model.y_test,model.prediction_test,"r.",label="Test")
plt.plot(model.y_train,model.prediction_train,"k.",label="Train")
plt.plot([0,1],[0,1],"k--",label="1:1 line")
plt.xlabel("Y observed")
plt.ylabel("Y predicted")
plt.legend()


Shape of the hidden layers:(10,)
Activation functions in the hidden layers:relu
Out[18]:
<matplotlib.legend.Legend at 0x7fc602e75da0>

Predicting new values from this model

Now we have made new observations, and we want to predict C given D. We can do that easily:


In [19]:
print("generating 10 new observations")
C_new_ = np.random.rand(10) #10 samples with random concentrations between 0 and 1
C_new_true = np.vstack((C_new_,(1-C_new_))).T
print("Shape of concentration matrix:"+str(C_new_true.shape))

noise_new = np.random.randn(len(x))*1e-4
Obs_new = np.dot(C_new_true,S_true) + noise_new

plt.plot(x,Obs_new.T)
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Our new observations in the same chemical system")
plt.show()


generating 10 new observations
Shape of concentration matrix:(10, 2)

Predictions!


In [20]:
C_new_predicted = model.predict(Obs_new)

Comparison!


In [21]:
plt.figure()
plt.title("New predictions with algorithm "+model.algorithm)
plt.plot(C_new_,C_new_predicted,"ko")
plt.plot([0,1],[0,1],"k--",label="1:1 line")
plt.xlabel("Y observed")
plt.ylabel("Y predicted")


Out[21]:
Text(0, 0.5, 'Y predicted')

RMSE!


In [22]:
from sklearn.metrics import mean_squared_error as mse

print("mean RMSE between newly predicted and observed data:" + str(np.sqrt(mse(C_new_,C_new_predicted))))


mean RMSE between newly predicted and observed data:0.007213124565690301

In [ ]: